Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics

نویسندگان

  • EDUARD HOVY
  • JULIA LAVID
چکیده

Corpus annotation—adding interpretive information into a collection of texts—is valuable for a number of reasons, including the validation of theories of textual phenomena and the creation of corpora upon which automated learning algorithms can be trained. This paper outlines the main challenges posed by human-coded corpus annotation for current corpus linguistic practice, describing some of the methodological steps required for this indispensable part of the research agenda of Corpus Linguistics in this decade. The first part of the paper presents an overview of the methodologies and open questions in corpus annotation as seen from the perspective of the field of Natural Language Processing. This is followed by an analysis of the theoretical and practical impact of corpus annotation in the field of Corpus Linguistics. It is suggested that collaborative efforts are necessary to advance knowledge in both fields, thereby helping to develop the kind of methodological rigour that would bring about a ‘science’ of annotation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What might a corpus of parsed spoken data tell us about language?

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable. This perspective unifies ‘corpus-driven’ and ‘theory...

متن کامل

ZT Corpus: Annotation and Tools for Basque Corpora

The ZT Corpus (Basque Corpus of Science and Technology) is a tagged collection of specialised texts in Basque, which aims to be a major resource in research and development with respect to written technical Basque: terminology, syntax and style. It was released in December 2006 and can be queried at http://www.ztcorpusa.net. The ZT Corpus stands out among other Basque corpora for many reasons: ...

متن کامل

The Spoken Dutch Corpus. Overview and First Evaluation

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, i...

متن کامل

OpenText.org: the problems and prospects of working with ancient discourse

1. Introduction The vast majority of studies in corpus linguistics have focused upon contemporary usage of modern languages. However, although there have been a number of studies of the earlier periods of some of these languages, such as Old English and Old French, they have tended to adopt the methods developed for modern languages. In our theoretical paper (Porter and O'Donnell 2001), we have...

متن کامل

Where Anaphora and Coreference Meet. Annotation in the Spanish CESS-ECE Corpus

This paper describes the guidelines of the annotation scheme designed to enrich the Spanish CESS-ECE corpus with coreference information, which is a significant step towards the definition of an exhaustive typology of pronominal and full NP coreferential expressions and their relations for Spanish. The goal is twofold. From a computational perspective, this work establishes the formal foundatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013